Skip to content

Conversation

slachiewicz
Copy link
Member

@slachiewicz slachiewicz commented Oct 10, 2025

…match in stale data cache

Problem

Starting in version 3.11.3, running mvn javadoc:javadoc twice in succession on Windows with Java 8 fails on the second run with:

java.nio.charset.MalformedInputException: Input length = 1
    at org.apache.maven.plugins.javadoc.AbstractJavadocMojo.isUpToDate(AbstractJavadocMojo.java:5008)

This regression affects Windows users running Java 8, where the default platform encoding is Cp1252.

Root Cause

The issue stems from a charset mismatch in how the stale data cache file is written versus how it's read:

  1. Write operation (StaleHelper.writeStaleData(), line 126):

    • Uses getDataCharset() which returns Charset.defaultCharset() for Java 8
    • On Windows, this is Cp1252
  2. Read operation (AbstractJavadocMojo.isUpToDate(), line 5008):

    • Always uses hardcoded StandardCharsets.UTF_8

When the second run attempts to read a file written with Cp1252 encoding using UTF-8, any non-ASCII bytes cause a MalformedInputException.

Solution

Changed StaleHelper to always save in StandardCharsets.UTF_8 instead of platform-dependent charset. This ensures:

  • Consistent encoding across all platforms (Windows, macOS, Linux)
  • Consistent encoding across all Java versions (8, 11, 17, 21, etc.)
  • Write and read operations use the same charset

Fixes: #1273 #1264

@slachiewicz slachiewicz added the bug Something isn't working label Oct 10, 2025
…match in stale data cache

Changed StaleHelper to always save in UTF-8 instead of
platform-dependent default charset. This ensures consistency with
AbstractJavadocMojo.isUpToDate() which reads the stale data file using UTF-8.

Previously on Windows with Java 8:
- First run: file written with Cp1252 (default charset)
- Second run: file read with UTF-8, causing MalformedInputException
@slachiewicz slachiewicz merged commit b453602 into master Oct 15, 2025
112 of 131 checks passed
@slachiewicz slachiewicz deleted the windowscharset branch October 15, 2025 22:38
@github-actions github-actions bot added this to the 3.12.0 milestone Oct 15, 2025
@fridrich
Copy link
Contributor

This basically reverting 33c9f01 which was actually fixing a real problem of inconsistency

@gnodet
Copy link
Contributor

gnodet commented Oct 17, 2025

This basically reverting 33c9f01 which was actually fixing a real problem of inconsistency

You mean this is reverting https://issues.apache.org/jira/browse/MJAVADOC-614 ?

@fridrich
Copy link
Contributor

The d2dd532 brought an inconsistency that I saw with a project containing chinese characters and I made it consistent with 33c9f01

I assume that this bug comes from something related to the way we are determining the charset in that getDataCharset function. It might be needed to make the decision making depending not only on java version but also on os/arch.

I will test this fix. If it is not making the previous bug come back, we can just leave it as it is, but if it regresses, we should look for a proper fix.

@gnodet
Copy link
Contributor

gnodet commented Oct 17, 2025

The d2dd532 brought an inconsistency that I saw with a project containing chinese characters and I made it consistent with 33c9f01

I assume that this bug comes from something related to the way we are determining the charset in that getDataCharset function. It might be needed to make the decision making depending not only on java version but also on os/arch.

I will test this fix. If it is not making the previous bug come back, we can just leave it as it is, but if it regresses, we should look for a proper fix.

@slachiewicz have a look at https://issues.apache.org/jira/browse/MJAVADOC-614, it mentions the intent for the JDK check, and the fact that the @-files is using UTF-8 on JDK 9..12 and Charset.defaultSet() on others.

@fridrich
Copy link
Contributor

Actually, as I look at it more and more and as I understand it better, I think this particular fix is the right one.

@gnodet
Copy link
Contributor

gnodet commented Oct 17, 2025

Actually, as I look at it more and more and as I understand it better, I think this particular fix is the right one.

But it won't work well on JDK 9, 10 and 11 as I raised above.

@fridrich
Copy link
Contributor

fridrich commented Oct 17, 2025

I mean, I run the ITs with openjdk 8 11 and 17 and they passed.

My only concern is that the mismatch is in what one gets from plexus-utils as path in string and its encoding and how one reads the line. The thing is that - as I understand it now - it boils down to the filesystem encoding itself for java 8. Because the com.sun.tools.javac.main.CommandLine.loadCmdFile in java 8 does not specify any encoding at all and when I go down the code, it looks like encoding is always assumed (or ignored?).
For java9-12 (verified in 11), the function calls Reader r = Files.newBufferedReader(Paths.get(name)) which assumes UTF-8.INSTANCE. That was changed to Reader r = Files.newBufferedReader(Paths.get(name), Charset.defaultCharset()) for 13+. Now the Charset.defaultCharset(), can it diverge from the filesystem encoding? Maybe and there it would be actually the problem that we have. Were the Charset.defaultCharset() for file content encoding, then that might be a possible mismatch. Like on Linux, you can have java 8 running on a filesystem that is UTF-8

@gnodet
Copy link
Contributor

gnodet commented Oct 17, 2025

I mean, I run the ITs with openjdk 8 11 and 17 and they passed.

On Windows, where the default encoding is not UTF-8 ?

I think the problem is that javadoc does expect a certain encoding, so we have no choice but to actually use that one. If we always write in UTF-8, I don't see how that would work.

@fridrich
Copy link
Contributor

I mean, I run the ITs with openjdk 8 11 and 17 and they passed.

On Windows ?

Indeed, no

@fridrich
Copy link
Contributor

I think the problem is that javadoc does expect a certain encoding, so we have no choice but to actually use that one. If we always write in UTF-8, I don't see how that would work.

So, maybe it would be better to cross-check where the original problem lies then. I was having a thought, that if while reading in the try catch block we actually set the charset we used for reading and we write using that one, it could eventually work. But then, I am not sure, because the exception does not necessarily need to be thrown if the path is a valid utf-8, but it is actually in other encoding.

@fridrich
Copy link
Contributor

fridrich commented Oct 17, 2025

Or, we extract the getDataCharset() somewhere in SystemUtils or so, make it public and use it everywhere where we need to use the charset

@fridrich
Copy link
Contributor

Let me craft something.

fridrich added a commit to fridrich/maven-javadoc-plugin that referenced this pull request Oct 17, 2025
@fridrich
Copy link
Contributor

Something like #1278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working hacktoberfest-accepted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression starting in 3.11.3: java.nio.charset.MalformedInputException: Input length = 1

3 participants